Exporting data is a crucial tool in any database user’s toolkit. In CockroachDB, the EXPORT command has long provided this essential functionality to:
Say, for example, you need to export a sizable amount of JSON to seed a core data store for a streaming service. Sometimes you can pre-seed this data alongside your application with an EXPORT. But if you’re working on the scale of a company like Netflix, EXPORT commands come with some limitations that led us to explore an additional way to export data.
First of all, EXPORTs have some configurability and integration limitations — but these are minor issues compared to two other problems:
There is a solution to these problems. In CockroachDB you can export data using changefeeds (available in CockroachDB Enterprise, our fully managed offering, CockroachDB dedicated, and CockroachDB serverless).
(CockroachDB Core users are invited to try an experimental feature: EXPERIMENTAL CHANGEFEED FOR.
This statement allows users to create a new changefeed to stream row-level changes to the client indefinitely, until the underlying connection is closed or the changefeed is canceled).
Enterprise change data capture (CDC) provides row-level change subscriptions for downstream processing. In CockroachDB users can create a new changefeed to stream row-level changes in a configurable format to a configurable sink such as Apache Kafka.
CREATE CHANGEFEED FOR table
INTO [kafka, webhook, cloud
storage, gc pubsub] WITH
format=csv, initial_scan=only;
The CREATE CHANGEFEED
statement creates a new CockroachDB enterprise changefeed, which targets an allowlist of tables called “watched rows”. Every change to a watched row is emitted as a record in a configurable format (JSON
or Avro) to a configurable sink (Kafka, Google Cloud Pub/Sub, a cloud storage sink, or a webhook sink). You can create, pause, resume, alter, or cancel an Enterprise changefeed.
In the base case, changefeeds send a stream of continuous updates to a table via a SQL connection or external system. However, using the initial_scan=only
option, a changefeed will emit a snapshot of the table and the job will complete.
Using changefeeds allows us to take advantage of a wider set of integrations and configurability, as well as the robust-at-scale changefeed architecture.
The huge advantage: in CockroachDB changefeeds are built to scale organically with your data. Here’s how it works:
Let’s compare CREATE CHANGEFEED
to a traditional EXPORT:
Changefeeds | EXPORT | |
---|---|---|
CSV format | yes |
yes |
Parquet format | not yet |
yes |
Avro, json format | yes |
no |
Sends results of arbitrary SELECT statements | not yet |
yes |
Sends results of a SELECT * statement | yes |
yes |
Multi table support | yes |
no |
Expect duplicate messages | yes |
no |
Send to cloud storage | yes |
yes |
Send to kafka, gc pubsub, webhook | yes |
no |
Scales to 10+ TB | yes |
no |
Requires enterprise license | yes |
no |
For more details about the differences in configurability, check out the export and enterprise changefeed docs.
Changefeeds offer a great alternative to EXPORTs for users at a high scale of data. There are still a few limitations compared to the EXPORT command, but changefeeds are quickly approaching feature parity. Moreover, changefeeds offer many configuration and destination options beyond those offered by EXPORTs.
NOTE: This blog requires a fairly in-depth understanding of your application and changefeeds. If you want to learn more …
Read moreCockroachDB is an excellent system of record, but no technology exists in a vacuum. Some of our users would like to keep …
Read moreA recent tweet inspired me to address the need for fuzzy matching by combining some existing capabilities of …
Read more